Introduction

The rapid global trend toward urbanization poses significant challenges for ensuring that rural populations are not left behind in development and policy considerations. Tracking vital statistics for these populations is increasingly difficult, leading to a shortage of reliable, up-to-date data. Private data, such as social media data, has shown the potential to complement or supplement the existing public data. This study explores using Facebook Advertising data, which offers a digital “census” of over two billion users, to measure and analyze rural-urban inequalities.

This vignette utilizes rSocialWatcher (Marty, 2020) to query Facebook Marketing data to understand divides between rural and urban behaviors in Sub-Saharan Africa. Facebook data is utilized to analyze statistically significant differences between chosen behaviors in urban and rural regions in Nigeria. This vignette is based on Facebook Ads as a Demographic Tool to Measure the Urban-Rural Divide (Daniele Rama et al., 2020) which similarly uses Facebook Marketing data to study the urban-rural divide in Italy.

Setup

## Load packages
library(rsocialwatcher)

library(knitr)

library(htmlwidgets)
library(htmltools)
library(jsonlite)
library(stringr)

library(dplyr)
library(tidyr)

library(viridis)
library(RColorBrewer)
library(ggplot2)
library(leaflet)
library(kableExtra)
library(ggpubr)

library(WDI)
library(sf)
library(raster)
library(exactextractr)
library(rnaturalearth)
library(rnaturalearthdata)

TOKEN      <- "Enter API Token Here"
VERSION <- "Enter Version Here"
CREATION_ACT <- "Enter Creation Act Here"

Facebook Population Data in sub-Saharan Africa

To thoroughly explore and understand the prevalence of Facebook usage across Sub-Saharan Africa, it is essential to analyze the adult population data for each country within the region. The total adult population data can be obtained from the World Development Indicators (WDI). We will then query the total adult Facebook population for each country to calculate the percentage of Facebook users per country. This approach will provide a comprehensive view of Facebook’s penetration among adults in Sub-Saharan Africa, allowing us to draw meaningful insights and make informed decisions based on social media usage trends in the region.

sub_saharan_countries <- c("AGO", "BEN", "BWA", "BFA", "BDI", "CPV", "CMR", "CAF", "TCD", 
                           "COM", "COG", "CIV", "COD", "DJI", "GNQ", "ERI", "SWZ", "ETH", 
                           "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "MDG", 
                           "MWI", "MLI", "MRT", "MUS", "MOZ", "NAM", "NER", "NGA", "RWA", 
                           "STP", "SEN", "SYC", "SLE", "SOM", "ZAF", "SSD", "TZA", "TGO", 
                           "UGA", "ZMB", "ZWE")

population_data <- WDI(country = sub_saharan_countries, indicator = "SP.POP.1564.TO", start = 2021, end = 2021)
#> Warning in open.connection(con, "rb"): URL
#> 'https://api.worldbank.org/v2/en/country/AGO;BEN;BWA;BFA;BDI;CPV;CMR;CAF;TCD;COM;COG;CIV;COD;DJI;GNQ;ERI;SWZ;ETH;GAB;GMB;GHA;GIN;GNB;KEN;LSO;LBR;MDG;MWI;MLI;MRT;MUS;MOZ;NAM;NER;NGA;RWA;STP;SEN;SYC;SLE;SOM;ZAF;SSD;TZA;TGO;UGA;ZMB;ZWE/indicator/SP.POP.1564.TO?format=json&date=2021:2021&per_page=32500&page=9':
#> Timeout of 60 seconds was reached

head(population_data,10) %>% kable
country iso2c iso3c year SP.POP.1564.TO
Angola AO AGO 2021 18020088
Burundi BI BDI 2021 6430031
Benin BJ BEN 2021 7063242
Burkina Faso BF BFA 2021 11793330
Botswana BW BWA 2021 1643461
Central African Republic CF CAF 2021 2691324
Cote d’Ivoire CI CIV 2021 15332583
Cameroon CM CMR 2021 14923302
Congo, Dem. Rep. CD COD 2021 48438607
Congo, Rep. CG COG 2021 3263441

To query the data of average Facebook users by country, we use rSocialWatcher to pull the data from the Facebook Marketplace API. The data is broken down into estimates of daily active users (DAU) and lower and upper bounds for monthly active users (MAU).

AF_data <- query_fb_marketing_api(
      location_unit_type = "country",
      location_keys      = map_param_vec(population_data$iso2c),
      version            = VERSION,
      creation_act       = CREATION_ACT,
      token              = TOKEN)

To determine the percentage of the adult population on Facebook, the mean of the lower and upper bounds of the monthly average users is divided by the adult population data obtained from the WDI. Due to Facebook’s age restrictions, only data for populations aged 18 and above is queried.

AF_data_filtered <- AF_data %>%
  inner_join(population_data, by = c("location_keys" = "iso2c")) %>%
  mutate(percent_pop = (estimate_mau_lower_bound + estimate_mau_upper_bound)/2 / SP.POP.1564.TO)

africa_sf <- ne_countries(continent = "Africa", returnclass = "sf")

africa_sf <- africa_sf %>%
  inner_join(AF_data_filtered, by = c("iso_a3" = "iso3c"))

Percentage of Facebook Users by Country

Below, the graph illustrates the percentage of the adult Facebook users across various Sub-Saharan Africa countries. Some notable observations include:

  • While Uganda and the Democratic Republic of Congo have relatively high populations, they have very low Facebook usage.
  • Central Africa in general has relatively low Facebook usage.
  • While certain countries in South and West Africa have high Facebook Usage - such as Botswana, Namibia, and Mauritania - they have very low total populations.
africa_sf$prcnt_p <- africa_sf$prcnt_p * 100

palette1 <- colorNumeric(
  palette = "YlOrRd", 
  domain = africa_sf$prcnt_p
)
palette2 <- colorNumeric(
  palette = "YlOrRd", 
  domain = africa_sf$SP_POP_
)


leaflet(height = 500) %>%
  addTiles() %>%
  addPolygons(data = africa_sf,
              popup = ~paste("<strong>Country:</strong> ", country, "<br>",
                             "<strong>Facebook Percentage:</strong> ", paste(round(prcnt_p, 2),"%", sep=""), "<br>",
                             "<strong>Adult Population:</strong> ", round(SP_POP_/1000000, 2), " million"),
              fillColor = ~palette2(SP_POP_),
              color = "black",
              weight = 1,
              opacity = 1,
              fillOpacity = 0.7, 
              group = "Adult Population") %>%
  addPolygons(data = africa_sf,
              popup = ~paste("<strong>Country:</strong> ", country, "<br>",
                             "<strong>Facebook Percentage:</strong> ", paste(round(prcnt_p, 2),"%", sep=""), "<br>",
                             "<strong>Adult Population:</strong> ", round(SP_POP_/1000000, 2), " million"),
              fillColor = ~palette1(prcnt_p),
              color = "black",
              weight = 1,
              opacity = 1,
              fillOpacity = 0.7, 
              group = "Facebook Percentage") %>%
  addLayersControl(
    baseGroups = c("Adult Population", "Facebook Percentage"),
    options = layersControlOptions(collapsed = FALSE, autoZIndex = FALSE)
  ) %>%
  addLegend(position = "bottomright", pal = palette2, values = africa_sf$SP_POP_,
            title = "Adult Population (millions)", className = "custom-legend legend1") %>%
  addLegend(position = "bottomright", pal = palette1, values = africa_sf$prcnt_p,
            title = "Facebook Percentage", className = "custom-legend legend2") %>%
  onRender("
    function(el, x) {
      var map = this;
      var legend1 = document.querySelector('.legend1');
      var legend2 = document.querySelector('.legend2');
      
      function updateLegends(e) {
        if (e.name === 'Adult Population') {
          legend1.style.display = 'block';
          legend2.style.display = 'none';
        } else if (e.name === 'Facebook Percentage') {
          legend1.style.display = 'none';
          legend2.style.display = 'block';
        }
      }
      
      map.on('baselayerchange', updateLegends);
      
      legend1.style.display = 'block';
      legend2.style.display = 'none';
    }
  ") %>%
  tagList(
    tags$style(HTML("
      .custom-legend {
        background-color: white;
        padding: 10px;
        border: 1px solid black;
        border-radius: 5px;
        box-shadow: 2px 2px 6px rgba(0,0,0,0.3);
      }
    "))
  )

Determining Difference in Urban/Rural Facebook Users in Nigeria

We needed enough regions to have statistically significant results and Facebook did not include subregion data for most of the countries. Nigeria is selected for three reasons:

  1. It has a large adult population at around 115 million.
  2. It has a relatively high percentage of people on Facebook (~40%)
  3. It has enough regions to support analysis. Most countries in Subsaharan support region level Facebook data but not sub-region level data.

Therefore, we need a country that has enough regions to support analysis. For example, South Africa only has 9 regions, which is not enough for useful analysis.

The names are adjusted to match the syntax and spelling of WDI data.

NG_fb_df <- get_fb_parameter_ids(type = "region", 
                               country_code = "NG",
                               version = VERSION, 
                               token = TOKEN)

NG_fb_df$name <- gsub("\\b state\\b", "", NG_fb_df$name, ignore.case = TRUE)
NG_fb_df$name <- str_replace_all(NG_fb_df$name, regex("Benué", ignore_case = TRUE), "Benue")
NG_fb_df$name <- str_replace_all(NG_fb_df$name, regex("Nasarawa", ignore_case = TRUE), "Nassarawa")

Determine Population of States

To obtain population data for each individual region, we use population raster data from WorldPop. Using a 1km resolution map and aggregate the data by region, we describe region population with multiple metric. This detailed approach allows for a precise and comprehensive analysis of population distribution and density across different regions.

population_raster <- raster("worldpop_data/nga_ppp_2020.tif")
population_df <- as.data.frame(population_raster, xy=TRUE)


population_df <- population_df[!is.na(population_df$nga_ppp_2020), ]
population_df$log_population <- log1p(population_df$nga_ppp_2020)

To validate our data, we extract the coordinates of locations in Nigeria with populations exceeding 1 million people and plot them on a map.

world_cities <- ne_download(scale = "medium", type = "populated_places", category = "cultural", returnclass = "sf")
#> Reading layer `ne_50m_populated_places' from data source 
#>   `/private/var/folders/pl/z72m_lzx4_d8qgb1jjgg6tgc0000gn/T/RtmpWFRKNr/ne_50m_populated_places.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 1251 features and 137 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -175.2206 ymin: -90 xmax: 179.2166 ymax: 78.22097
#> Geodetic CRS:  WGS 84
nigeria_cities <- world_cities %>% filter(ADM0NAME == "Nigeria")
largest_nigeria_cities <- nigeria_cities %>% filter(POP_MAX > 1000000)

nigeria <- ne_states(country = "Nigeria", returnclass = "sf")

largest_nigeria_cities %>% 
  dplyr::select(NAME, POP_MAX, ADM1NAME) %>% 
  kable
NAME POP_MAX ADM1NAME geometry
Benin City 1190000 Edo POINT (5.618062 6.342423)
Port Harcourt 1020000 Rivers POINT (7.008055 4.811948)
Ibadan 2628000 Oyo POINT (3.928036 7.381972)
Kaduna 1442000 Kaduna POINT (7.438054 10.52196)
Abuja 1576000 Federal Capital Territory POINT (7.489505 9.05462)
Kano 3140000 Kano POINT (8.518092 12.00192)
Lagos 9466000 Lagos POINT (3.389585 6.445208)
ggplot() +
  geom_raster(data = population_df, mapping = aes(x = x, y = y, fill = log_population)) +
  scale_fill_gradient(low = "darkblue", high = "lightblue", na.value = "transparent") +
  labs(title = "Population Density (1km) with Large Cities",
       fill = "Log(Population)") +
  geom_sf(data = nigeria, fill = NA, color = "black", linewidth = 0.5) +
  geom_sf(data = largest_nigeria_cities, aes(geometry = geometry), fill = "white", color = "white", size = 4, shape = 2, stroke = 1.2) +
  geom_text(data = largest_nigeria_cities, aes(x = st_coordinates(geometry)[,1], y = st_coordinates(geometry)[,2], label = NAME), 
            size = 3, nudge_y = 0.54, nudge_x = 0.1, color = "white") +
  theme_minimal() +
  theme(
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        plot.background = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
    )

To determine which regions to classify as urban, we evaluate several metrics for each region:

  • Total Population: The aggregate number of people in the region.
  • Mean Population: Equivalent to population density, representing the average population per 1km².
  • 95th Percentile Population: The population count at the 95th percentile of 1km blocks, indicating high-density areas.
  • Log Total Population: The logarithmic scale of the total population, which helps in analyzing large numbers more effectively.
  • Log Mean Population: The logarithmic scale of the mean population, aiding in better understanding of population density.
  • Log 95th Percentile Population: The logarithmic scale of the 95th percentile population, providing insight into the distribution of high-density areas.
percentile_95 <- function(values, coverage_fractions) {
  quantile(values, probs = 0.99, na.rm = TRUE)
}

nigeria_regions <- st_transform(nigeria, crs(population_raster))

total_population <- exact_extract(population_raster, nigeria_regions, fun = 'sum', progress = F)
mean_population <- exact_extract(population_raster, nigeria_regions, fun = 'mean', progress = F)
percentile_95_values <- exact_extract(population_raster, nigeria_regions, fun = percentile_95, progress = F)

nigeria_regions$pop_percentile <- percentile_95_values
nigeria_regions$total_population <- total_population
nigeria_regions$mean_population <- mean_population
nigeria_regions$log_pop_percentile = log1p(nigeria_regions$pop_percentile)
nigeria_regions$log_total_population = log1p(nigeria_regions$total_population)
nigeria_regions$log_mean_population = log1p(nigeria_regions$mean_population)

nigeria_filtered_regions <- nigeria_regions %>%
  dplyr::select(name, pop_percentile, total_population, mean_population, log_pop_percentile, log_total_population, log_mean_population)
generate_continuous_legend_html <- function(title, palette, values) {
  bins <- pretty(values, n = 7)
  colors <- palette(bins)
  labels <- paste(
    "<i style='background:", colors, "; width: 18px; height: 18px; display: inline-block;'></i>",
    ifelse(bins == min(bins), "Low", ifelse(bins == max(bins), "High", "")),
    collapse = "<br>"
  )
  paste("<h4>", title, "</h4>", labels)
}

ttl_ppl_range <- range(nigeria_filtered_regions$ttl_ppl, na.rm = TRUE)
mn_pplt_range <- range(nigeria_filtered_regions$mn_pplt, na.rm = TRUE)
pp_prcn_range <- range(nigeria_filtered_regions$pp_prcn, na.rm = TRUE)
lg_ttl_range <- range(nigeria_filtered_regions$lg_ttl_, na.rm = TRUE)
lg_mn_p_range <- range(nigeria_filtered_regions$lg_mn_p, na.rm = TRUE)
lg_pp_p_range <- range(nigeria_filtered_regions$lg_pp_p, na.rm = TRUE)

pop_palette <- colorNumeric("YlOrRd", ttl_ppl_range, na.color = "transparent")
mean_palette <- colorNumeric("YlOrRd", mn_pplt_range, na.color = "transparent")
percentile_palette <- colorNumeric("YlOrRd", pp_prcn_range, na.color = "transparent")
log_pop_palette <- colorNumeric("YlOrRd", lg_ttl_range, na.color = "transparent")
log_mean_palette <- colorNumeric("YlOrRd", lg_mn_p_range, na.color = "transparent")
log_percentile_palette <- colorNumeric("YlOrRd", lg_pp_p_range, na.color = "transparent")

legend_html <- generate_continuous_legend_html("Value", pop_palette, ttl_ppl_range)
#> Warning in palette(bins): Some values were outside the color scale and will be
#> treated as NA

combined_map <- leaflet(height = 700, width = 900) %>%
  addTiles() %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~pop_palette(ttl_ppl),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(ttl_ppl, 2)), 
              group = "Population") %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~mean_palette(mn_pplt),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(mn_pplt, 2)), 
              group = "Mean Pop") %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~percentile_palette(pp_prcn),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(pp_prcn, 2)), 
              group = "95% Pop") %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~log_pop_palette(lg_ttl_),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(lg_ttl_, 2)), 
              group = "Pop - Log") %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~log_mean_palette(lg_mn_p),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(lg_mn_p, 2)), 
              group = "Mean Pop - Log") %>%
  addPolygons(data = nigeria_filtered_regions,
              fillColor = ~log_percentile_palette(lg_pp_p),
              color = "black", weight = 1, opacity = 1, fillOpacity = 0.7,
              popup = ~paste("<strong>Region:</strong> ", name, "<br>",
                             "<strong>Metric:</strong> ", round(lg_pp_p, 2)), 
              group = "95% Pop - Log") %>%
  addLayersControl(
    baseGroups = c("Population", "Mean Pop", "95% Pop",
                   "Pop - Log", "Mean Pop - Log", 
                   "95% Pop - Log"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  addControl(html = legend_html, position = "bottomright") %>%
  htmlwidgets::onRender("
    function(el, x) {
      var map = this;
      function updateLegend(e) {
        var layer = e.name;
        var legendHTML = document.querySelector('.custom-legend');
        legendHTML.innerHTML = legend_htmls[layer];
      }
      map.on('baselayerchange', updateLegend);
      // Initialize legend display
      updateLegend({name: 'Population'});
    }
  ") %>%
  tagList(
    tags$style(HTML("
      .custom-legend {
        background-color: white;
        padding: 10px;
        border: 1px solid black;
        border-radius: 5px;
        box-shadow: 2px 2px 6px rgba(0,0,0,0.3);
        font-size: 12px;
        line-height: 18px;
      }
      .leaflet-bottom.leaflet-right .custom-legend {
        margin-bottom: 20px;
      }
    "))
  )

combined_map

The log of the 95th percentile produces the most interesting and useful results for clustering. Applying the log transformations is necessary because Lagos is much more densely and highly populated than the rest of Nigeria. This transformation helps normalize the data, making it easier to identify and analyze patterns in population distribution across different regions.

ggplot(nigeria_filtered_regions) +
    geom_sf(aes(fill = lg_pp_p)) +
    scale_fill_distiller(palette = "YlOrRd", direction = 1) + 
    labs(title = "95% Population (1km) - Log Scale",
         fill = "Population") +
    theme_minimal() + 
    theme(
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        plot.background = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
    )

Classify as Urban and Rural (Clustering)

To classify regions as urban or rural, we use k-means clustering. The elbow plot shows a clear bend at two clusters, supporting this methodology. This approach allows us to effectively differentiate between urban and rural areas based on population metrics, ensuring accurate and meaningful classification.

metric <- nigeria_filtered_regions$lg_pp_p

#metric <- nigeria_filtered_regions$log_pop_percentile
wss <- sapply(1:10, function(k){
  kmeans(metric, centers = k, nstart = 25)$tot.withinss
})

wss_df <- data.frame(
  clusters = 1:10,
  wss = wss
)

ggplot(wss_df, aes(x = clusters, y = wss)) +
  geom_point() +
  geom_line() +
  labs(title = "Elbow Method for Determining Optimal Number of Clusters",
       x = "Number of Clusters",
       y = "Total Within-Cluster Sum of Squares (WSS)") +
  theme_minimal() + 
      theme(
        plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
    )

After applying k-means clustering with two clusters, the regions are categorized as follows. The results align with expectations: regions containing the largest cities (Lagos, Abuja, Kano) are classified as urban, along with the more densely populated coastal areas. An exception is Ibadan. Although Ibadan is one of the largest cities in Nigeria, the Oyo region is not densely populated, thus it falls outside the urban classification.

set.seed(42)

kmeans_result <- kmeans(metric, centers = 2, nstart = 25)
nigeria_filtered_regions$Cluster <- as.factor(kmeans_result$cluster)
nigeria_filtered_regions_sf <- st_as_sf(nigeria_filtered_regions)

yl_or_rd_colors <- brewer.pal(3, "YlOrRd")[1:2]

cluster_labels <- c("1" = "Urban", "2" = "Rural")

ggplot(nigeria_filtered_regions_sf) +
  geom_sf(aes(fill = Cluster)) +
  labs(title = "K-means Clustering on Log Pop Population",
       fill = "Cluster") +
  geom_sf(data = largest_nigeria_cities, aes(geometry = geometry), 
          fill = "black", color = "black", size = 4, shape = 2, stroke = 1.2) +
  geom_text(data = largest_nigeria_cities, 
            aes(x = st_coordinates(geometry)[,1], 
                y = st_coordinates(geometry)[,2], 
                label = NAME), 
            size = 3, nudge_y = 0.5, nudge_x = 0.1, color = "black") +
  scale_fill_manual(values = yl_or_rd_colors, labels = cluster_labels) + 
  theme_minimal() + 
  theme(
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    plot.background = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold")
  )

For reference, the distribution of regions based on their log 95th percentile population is displayed, showing how the regions are grouped. Despite the log transformation, Lagos remains distinctly separate from other regions, indicating its uniquely high population density.

plot_data <- data.frame(
  x = nigeria_filtered_regions$lg_pp_p,
  cluster = nigeria_filtered_regions$Cluster
)

custom_colors <- c("blue", "orange")

ggplot(plot_data, aes(x = x, y = 0, color = cluster)) +
  geom_point(size = 4) +
  annotate("segment", x = min(plot_data$x), xend = max(plot_data$x), y = 0, yend = 0, size = 1) +
  annotate("segment", x = min(plot_data$x), xend = min(plot_data$x), y = -0.1, yend = 0.1, size = 1) +
  annotate("segment", x = max(plot_data$x), xend = max(plot_data$x), y = -0.1, yend = 0.1, size = 1) +
  geom_text(aes(label = round(x, 1)), col = "white", vjust = -1.5, size = 3) +
  scale_x_continuous(limits = range(plot_data$x)) +
  scale_y_continuous(limits = c(-1, 1)) +
  scale_color_manual(values = custom_colors) + 
  theme(panel.background = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.position = "none") +
  labs(title = "K-means Clustering on Total Population")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Querying Facebook Interests Data

We select the following behaviors and interests to study as selected by Rama et Al (2020).

Selected Behaviors

  • Frequent Travelers
  • Frequent international travelers
  • Technology early adopters
  • Facebook access (mobile): Apple (iOS) devices
  • Lives abroad
  • Facebook access (network type): 4G
  • Facebook access (network type): 3G
  • Facebook access (mobile): Android devices

Selected Interests

  • Gambling (gambling)
  • Concerts (music event)
  • Cooking (food & drink)
  • Fast food restaurants (dining)
  • Restaurants (dining)
  • Politics (politics)

For each of the selected interests and behaviors, we need to acquire the correct keys to pull the accurate data. This ensures that our queries retrieve the specific information required for our analysis.

interests_df <- get_fb_parameter_ids("interests", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
#> Warning in get_fb_parameter_ids("interests", VERSION, TOKEN): Error code: 190
#> Warning in get_fb_parameter_ids("interests", VERSION, TOKEN): Error validating
#> access token: Session has expired on Friday, 05-Jul-24 06:57:04 PDT. The
#> current time is Friday, 02-Aug-24 05:27:46 PDT.
behaviors_df <- get_fb_parameter_ids("behaviors", VERSION, TOKEN)
#> No encoding supplied: defaulting to UTF-8.
#> Warning in get_fb_parameter_ids("behaviors", VERSION, TOKEN): Error code: 190
#> Warning in get_fb_parameter_ids("behaviors", VERSION, TOKEN): Error validating
#> access token: Session has expired on Friday, 05-Jul-24 06:57:04 PDT. The
#> current time is Friday, 02-Aug-24 05:27:46 PDT.

selected_behaviors <- c("Frequent Travelers",
                        "Frequent international travelers",
                        "Technology early adopters",
                        "Facebook access (mobile): Apple (iOS) devices",
                        "Lives abroad",
                        "Facebook access (network type): 4G",
                        "Facebook access (network type): 3G",
                        "Facebook access (mobile): Android devices")

selected_interests <- c("Gambling (gambling)",
                        #"Organic food (food & drink)",
                        "Concerts (music event)",
                        "Fast food restaurants (dining)",
                        #"Tattoos (body art)",
                        #"Reading (communication)",
                        "Restaurants (dining)",
                        #"Fitness and wellness (fitness)",
                        "Cooking (food & drink)",
                        #"Vegetarianism (diets)",
                        #"Jazz music (music)",
                        #"Hunting (sport)",
                        #"Fine art (visual art)",
                        "Politics (politics)")
behaviors_id <- behaviors_df %>%
  filter(name %in% selected_behaviors) %>%
  pull(id)

interests_id <- interests_df %>%
  filter(name %in% selected_interests) %>%
  pull(id)

We query each behavior and interests across the regions and save the data locally for further analysis and reference. Additionally, we also study gender data for each region. Gender is analyzed separately from behavior and interests because it was included as a distinct variable in the original study.

## Behaviors and Interests
for (behavior in behaviors_id) {
  fb_df <- query_fb_marketing_api(
    location_unit_type = "region",
    location_keys      = map_param_vec(NG_fb_df$key),
    behaviors          = behavior,
    version            = VERSION,
    creation_act       = CREATION_ACT,
    token              = TOKEN
  )
  
  write.csv(fb_df, file = paste0("data/nigeria/", behavior, ".csv"))
  
  print(paste("Done with", behavior))
}

for (interest in interests_id) {
  fb_df <- query_fb_marketing_api(
      location_unit_type = "region",
      location_keys      = map_param_vec(NG_fb_df$key),
      interests          = interest,
      version            = c(VERSION, VERSION1),
      creation_act       = c(CREATION_ACT, CREATION_ACT1),
      token              = c(TOKEN, TOKEN1)
  )
  
  write.csv(fb_df, file = paste0("data/nigeria/", interest, ".csv"))
  
  print(paste("Done with", interest))
}
## Gender
fb_df <- query_fb_marketing_api(
    location_unit_type = "region",
    location_keys      = map_param_vec(NG_fb_df$key),
    gender             = map_param(1, 2),
    version            = VERSION,
    creation_act       = CREATION_ACT,
    token              = TOKEN
)

male <- fb_df %>%
  filter(gender == 1)
write.csv(male, file = paste0("data/nigeria/male.csv"))

female <- fb_df %>%
  filter(gender == 2)
write.csv(female, file = paste0("data/nigeria/female.csv"))

To compare the percentage of Facebook users in each region associated with a behavior, gender, or interest, we need to query the region-level Facebook population data. This data will provide the necessary baseline for calculating and comparing user engagement across different demographics and interests.

NG_fb_region_data <- query_fb_marketing_api(
    location_unit_type = "region",
    location_keys      = map_param_vec(NG_fb_df$key),
    version            = VERSION,
    creation_act       = CREATION_ACT, 
    token              = TOKEN)

NG_fb_region_data = NG_fb_region_data %>%
                    mutate(users = (estimate_mau_lower_bound + estimate_mau_upper_bound)/2) %>%
                    dplyr::select(location_keys, users) %>%
                    mutate(key = as.integer(location_keys))

We then conduct our statistical tests for each interest, behavior, and gender. Specifically, we use a t-test to compare the average population of Facebook users between urban and rural regions. The results, including the difference in means and the 95% confidence interval, are saved for further analysis and interpretation. Furthermore, the means and variances of rural, urban, and all regions is saved for each behavior.

nigeria_result_df <- data.frame(Difference_in_Means = numeric(), Lower_CI = numeric(), Upper_CI = numeric(), behavior = character())
means_df <- data.frame(Behavior = character(), Mean = numeric(), Variance = numeric(), Group1_Mean = numeric(), Group1_Var = numeric(), Group2_Mean = numeric(), Group2_Var = numeric())

process_behavior <- function(x, y) {
  data <- read.csv(paste0("data/nigeria/", y, ".csv"))
  
  merged_data <- NG_fb_df %>%
    mutate(key = as.integer(key)) %>%
    left_join(nigeria_filtered_regions, by = "name") %>%
    left_join(data, by = c("key" = "location_keys")) %>%
    left_join(NG_fb_region_data, "key") %>%
    mutate(average_pop = ((estimate_mau_lower_bound + estimate_mau_upper_bound))/2/users) %>%
    dplyr::select(key, name, geometry, Cluster, average_pop)
  
  overall_mean <- mean(merged_data$average_pop, na.rm = TRUE)
  variance <- var(merged_data$average_pop, na.rm = TRUE)
  
  group1 <- merged_data$average_pop[merged_data$Cluster == 1]
  group2 <- merged_data$average_pop[merged_data$Cluster == 2]
  
  means_df <<- rbind(means_df, data.frame(
      Behavior = x,
      Mean = overall_mean,
      Variance = variance,
      Group1_Mean = mean(group1, na.rm = TRUE),
      Group1_Var = var(group1, na.rm = TRUE),
      Group2_Mean = mean(group2, na.rm = TRUE),
      Group2_Var = var(group2, na.rm = TRUE)
  ))
    
  t_test_result <- t.test(group1, group2)
  
  diff_means <- t_test_result$estimate[1] - t_test_result$estimate[2]  # Difference in means
  conf_interval <- t_test_result$conf.int
  
  nigeria_result_df <<- rbind(nigeria_result_df, data.frame(
      Difference_in_Means = diff_means,
      Lower_CI = conf_interval[1],
      Upper_CI = conf_interval[2],
      behavior = x
  ))
}

mapply(process_behavior, c(selected_behaviors, selected_interests, "Male", "Female"), c(behaviors_id, interests_id, "male", "female"))

nigeria_result_df <- nigeria_result_df %>%
  arrange(desc(Difference_in_Means)) %>%
  mutate(behavior = factor(behavior, levels = unique(behavior)))

Results and Discussion

Below, we visualize our results. The right side of the graph represents more urban regions, while the left side corresponds to more rural areas.

Most of these results are as expected. Interests such as Restaurants, Politics, Lives Abroad, Fast Food, and iOS tend to be more prevalent in urban areas, which aligns with common assumptions. However, some findings are quite surprising, such as the higher prevalence of Frequent International Travelers and Early Technology Adopters in rural areas. According to Rama et al. (2020), Early Technology Adopters, while generally leaning towards urban areas, are not exclusively urban, with the interquartile range spanning both urban and rural regions.

nigeria_result_df <- nigeria_result_df %>%
  arrange(desc(Difference_in_Means)) %>%
  mutate(behavior = factor(behavior, levels = unique(behavior)))

ggplot(nigeria_result_df, aes(x = behavior, y = Difference_in_Means, ymin = Lower_CI, ymax = Upper_CI)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  coord_flip() +
  theme_minimal(base_size = 14) +  
  labs(title = "Differences in Means Between Urban and Rural Regions of Nigeria", x = NULL, y = "Difference in Means (More Rural - More Urban)") + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 14),
    axis.title = element_text(size = 16),
    legend.title = element_text(size = 16),
    legend.text = element_text(size = 18)
  )

For reference, here are the overall means and the ranges within one standard deviation for each behavior across all regions. The graphs display the average values of each behavior, with error bars representing the variability. One observation, for example, is that while there is a noticeable bias towards males in rural areas and females in urban areas, both are still male dominant This indicates that rural areas have an overwhelming male presence on Facebook, whereas urban areas show a more balanced, yet still male-dominant, representation.

means_df$Behavior <- str_wrap(means_df$Behavior, width =40)

means_df <- means_df %>%
  arrange(Mean) %>%
  mutate(Behavior = factor(Behavior, levels = unique(Behavior)))

means_long <- means_df %>%
  dplyr::select(Behavior, Mean, Variance, Group1_Mean, Group1_Var, Group2_Mean, Group2_Var) %>%
  gather(key = "Group", value = "Value", -Behavior, -Variance, -Group1_Var, -Group2_Var) %>%
  mutate(Group_Var = case_when(
    Group == "Mean" ~ Variance,
    Group == "Group1_Mean" ~ Group1_Var,
    Group == "Group2_Mean" ~ Group2_Var
  ),
  Group = recode(Group, 
                 "Mean" = "Overall",
                 "Group1_Mean" = "Urban",
                 "Group2_Mean" = "Rural")
  )

ggplot(means_long, aes(y = Behavior, x = Value * 100, fill = Group)) +
  geom_bar(stat = "identity", position = position_dodge(0.8), width = 0.8) +
  geom_errorbar(aes(xmin = (Value - sqrt(Group_Var)) * 100, xmax = (Value + sqrt(Group_Var)) * 100), 
                position = position_dodge(0.8), width = 0.3) +
  scale_fill_manual(values = c("Overall" = "grey", "Urban" = "blue", "Rural" = "red")) +
  theme(axis.text.y = element_text(angle = 0, hjust = 1)) +
  labs(title = "Overall Mean with Variance and Cluster Means for Each Behavior", x = "Mean (%)", y = NULL) + 
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  theme(
    axis.text.y = element_text(angle = 0, hjust = 1, vjust = 1),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10)
  )

## Average Percent Male Users:

- Urban Areas: 54.44%
- Rural Areas: 65.85%

Conclusion

This vignette demonstrates how private data, specifically Facebook Marketplace data, can be leveraged where public data of communities is lacking or nonexistent. This is particularly pertinent for rural communities where data has historically been inaccurate, incomplete, or lacking. Using rSocialWatcher, we were able to query specific population and interest data at a region level using Facebook Marketplace data. By combining this with existing public data, we were able to derive insights regarding rural and urban differences in Nigeria. We hope to see how you utilize this package in the future for your social science projects.